Importing Libraries and Loading Data¶

This block is for importing necessary libraries and loading the COVID-19 confirmed cases datasets.

In [4]:
import pandas as pd

# Load the dataset
file_path = 'time_series_covid19_confirmed_US.csv'
data = pd.read_csv(file_path)

The script displays the first few rows and the column headers of the dataset.

In [5]:
# Display the first few rows and the columns of the dataset
data.head(), data.columns
Out[5]:
(        UID iso2 iso3  code3    FIPS   Admin2 Province_State Country_Region  \
 0  84001001   US  USA    840  1001.0  Autauga        Alabama             US   
 1  84001003   US  USA    840  1003.0  Baldwin        Alabama             US   
 2  84001005   US  USA    840  1005.0  Barbour        Alabama             US   
 3  84001007   US  USA    840  1007.0     Bibb        Alabama             US   
 4  84001009   US  USA    840  1009.0   Blount        Alabama             US   
 
          Lat      Long_  ... 2/28/23  3/1/23  3/2/23  3/3/23  3/4/23  3/5/23  \
 0  32.539527 -86.644082  ...   19732   19759   19759   19759   19759   19759   
 1  30.727750 -87.722071  ...   69641   69767   69767   69767   69767   69767   
 2  31.868263 -85.387129  ...    7451    7474    7474    7474    7474    7474   
 3  32.996421 -87.125115  ...    8067    8087    8087    8087    8087    8087   
 4  33.982109 -86.567906  ...   18616   18673   18673   18673   18673   18673   
 
    3/6/23  3/7/23  3/8/23  3/9/23  
 0   19759   19759   19790   19790  
 1   69767   69767   69860   69860  
 2    7474    7474    7485    7485  
 3    8087    8087    8091    8091  
 4   18673   18673   18704   18704  
 
 [5 rows x 1154 columns],
 Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
        'Country_Region', 'Lat', 'Long_',
        ...
        '2/28/23', '3/1/23', '3/2/23', '3/3/23', '3/4/23', '3/5/23', '3/6/23',
        '3/7/23', '3/8/23', '3/9/23'],
       dtype='object', length=1154))

Data Manipulation and Outlier Detection¶

The latest date is extracted from the dataset and calculates the total confirmed cases for each region as of that date. It then detect outliers, defining any region's case count that lies beyond 1.5 times the IQR from the first and third quartiles as outliers. Resulting in the preview of these outliers and their count.

In [6]:
# Extracting the latest date from the dataset to calculate the total confirmed cases up to that date
latest_date = data.columns[-1]  # Assumes the last column is the latest date

# Calculate total cases for each region as of the latest date
data['Total_Cases_Latest'] = data[latest_date]

# Using the Interquartile Range (IQR) to detect outliers
Q1 = data['Total_Cases_Latest'].quantile(0.25)
Q3 = data['Total_Cases_Latest'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers as regions where the total cases are beyond 1.5 times the IQR from the Q1 or Q3
outlier_condition = ((data['Total_Cases_Latest'] < (Q1 - 1.5 * IQR)) |
                     (data['Total_Cases_Latest'] > (Q3 + 1.5 * IQR)))

outliers = data.loc[outlier_condition, ['Admin2', 'Province_State', 'Total_Cases_Latest']]
outliers.head(), outliers.shape
Out[6]:
(       Admin2 Province_State  Total_Cases_Latest
 1     Baldwin        Alabama               69860
 36  Jefferson        Alabama              238727
 40        Lee        Alabama               47646
 44    Madison        Alabama              116086
 48     Mobile        Alabama              134986,
 (442, 3))

Aggregating and Analyzing State Data¶

The total COVID-19 cases are aggregated by state and calculates the mean and standard deviation of these totals. Then, it identifies states with cases that are more than two standard deviations from the mean as outliers.

In [7]:
import plotly.express as px
import pandas as pd
import numpy as np

# Assuming 'data' is already loaded and contains a column 'Province_State' for states
# and 'Total_Cases_Latest' for the latest total cases

# Aggregate total cases by state
state_cases = data.groupby('Province_State')['Total_Cases_Latest'].sum().reset_index()

# Calculate the mean and standard deviation for the cases
mean_cases = state_cases['Total_Cases_Latest'].mean()
std_cases = state_cases['Total_Cases_Latest'].std()

# Identify outliers (e.g., cases that are more than 2 standard deviations from the mean)
state_cases['Outlier'] = np.abs(state_cases['Total_Cases_Latest'] - mean_cases) > 2 * std_cases

# Create an interactive scatter plot
fig = px.scatter(state_cases, x='Province_State', y='Total_Cases_Latest',
                 color='Outlier',  # Use the Outlier column to set color: true for outlier, false for not
                 color_continuous_scale=px.colors.sequential.Viridis,  # Color scale
                 hover_name='Province_State',  # Shows state name on hover
                 hover_data={'Total_Cases_Latest': True, 'Outlier': False},  # Shows cases on hover, hide outlier boolean
                 labels={'Total_Cases_Latest': 'Total Cases', 'Province_State': 'State'},
                 title='Interactive Scatter Plot of Total COVID-19 Cases for Each State Highlighting Outliers')

# Show the plot
fig.show()
In [8]:
import plotly.express as px

# Sort the outliers by 'Total_Cases_Latest' in descending order and take the top 10
outliers_sorted = outliers.sort_values(by='Total_Cases_Latest', ascending=False).head(10)

# Convert 'Total_Cases_Latest' to millions
outliers_sorted['Total_Cases_Latest_Millions'] = outliers_sorted['Total_Cases_Latest'] / 1_000_000

# Create an interactive bar chart for the top 10 outlier regions
fig = px.bar(outliers_sorted, x='Total_Cases_Latest_Millions', y=outliers_sorted['Admin2'] + ", " + outliers_sorted['Province_State'],
             labels={'x': 'Total Cases (millions)', 'y': 'Region'},
             title='Top 10 Total COVID-19 Cases in Outlier Regions (Millions)')

# Display the plot
fig.show()

The Python script performs a detailed analysis of COVID-19 confirmed cases in the U.S., identifying outliers both at regional and state levels using statistical methods like Interquartile Range (IQR) and standard deviation measures.

Key observations are - States and regions significantly deviating from typical case counts are marked as outliers, to focus on areas with unusual patterns. Visual tools like scatter plots and bar charts illustrate the distribution of cases, also highlighting states with exceptional numbers. By this analysis we can understand the areas might need more intense public health interventions or resource allocations due to their atypical case numbers.